IR Evaluation Using Multiple Assessors per Topic
نویسندگان
چکیده
Information retrieval test sets consist of three parts: documents, topics, and assessments. Assessments are time-consuming to generate. Even using pooling it took about 7 hours per topic to assess for INEX 2006. Traditionally the assessment of a single topic is performed by a single human. Herein we examine the consequences of using multiple assessors per topic. A set of 15 topics were used. The mean topic pool contained 98 documents. Between 3 and 5 separate assessors (per topic) assessed all documents in a pool. One assessor was designated baseline. All were then used to generate 10,000 synthetic multi-assessor assessment sets. The baseline relative rank order of all runs submitted to the INEX 2006 relevant-in-context task was compared to those of the synthetics. The mean Spearman’s rank correlation coefficient was 0.986 and all coefficients were above 0.95 – the correlation is very strong. Non matching rank-orders are seen when the mean average precision difference between runs is less than 0.05. In the top 10 runs no significantly different runs were ranked in a different order in more than 5% of the synthetics. Using multiple assessors per topic is very unlikely to affect the outcome of an evaluation forum.
منابع مشابه
The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students
is paper reports on a case study on the inter-assessor disagreements in the English NTCIR-13 We Want Web (WWW) collection. For each of our 50 topics, pooled documents were independently judged by three assessors: two “lancers” and oneWaseda University student. A lancer is a worker hired through a Japanese part time job matching website, where the hirer is required to rate the quality of the la...
متن کاملUnanimity-Aware Gain for Highly Subjective Assessments
IR tasks have diversied: human assessments of items such as social media posts can be highly subjective, in which case it becomes necessary to hire many assessors per item to reect their diverse views. For example, the value of a tweet for a given purpose may be judged by (say) ten assessors, and their ratings could be summed up to dene its gain value for computing a graded-relevance evaluat...
متن کاملUMass Complex Interactive Question Answering (ciQA) 2007: Human Performance as Question Answerers
Every day, people widely use information retrieval (IR) systems to answer their questions. We utilized the TREC 2007 complex, interactive question answering (ciQA) track to measure the performance of humans using an interactive IR system to answer questions. Using our IR system, assessors searched for relevant documents and recorded answers to their questions. We submitted the assessors’ answer...
متن کاملDiscounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions
IR research has a strong tradition of laboratory evaluation of systems. Such research is based on test collections, pre-defined test topics, and standard evaluation metrics. While recent research has emphasized the user viewpoint by proposing user-based metrics and non-binary relevance assessments, the methods are insufficient for truly user-based evaluation. The common assumption of a single q...
متن کاملPreference Judgments for Relevance
Information retrieval systems have traditionally been evaluated over absolute judgments of relevance: each document is judged for relevance on its own, independent of other documents that may be on topic. We hypothesize that preference judgments of the form “document A is more relevant than document B” are easier for assessors to make than absolute judgments, and provide evidence for our hypoth...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007